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Abstract: Chemical compounds are usually represented as graph structured data in computers. In this review article, we overview 
several graph classes relevant to chemical compounds and the computational complexities of several fundamental problems for 
these graph classes. In particular, we consider the following problems: determining whether two chemical graphs are identical, 
determining whether one input chemical graph is a part of the other input chemical graph, finding a maximum common part of 
two input graphs, finding a reaction atom mapping, enumerating possible chemical graphs, and enumerating stereoisomers. We also 
discuss the relationship between the fifth problem and kernel functions for chemical compounds. 
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Introduction 

In computer analysis of chemical compounds, chemical structures 
are usually represented as graph structured data. Mathematically, a 
graph consists of a set of vertices and a set of edges, where a vertex 
represents some object and an edge represents a relation between two 
objects. From a chemical viewpoint, a graph corresponds to a 
chemical structural formula, in which a vertex and an edge correspond 
to an atom and a chemical bond, respectively. Furthermore, an atom 
type and a bond type are represented by labels of a vertex and an edge, 
respectively. In this article, graphs with such labels are called chemical 
graphs. 

Graphs are a very important concept in computer science, and 
extensive studies have been done to develop efficient algorithms for a 
number of graph problems. Among these problems, we focus on 
fundamental problems relevant to chemical graphs. In particular, we 
consider comparison and enumeration of chemical graphs because 
these are fundamental and have a long history in chemoinformatics. 
For example, unique naming of chemical compounds and enumeration 
of isomers have been studied for more than 100 years [I], much 
before the invention of computers. In this article, we consider the 
following problems: 

(i) determining whether two chemical graphs are identical, 

(ii) determining whether one input chemical graph is a part of the 
other input chemical graph, 

(iii) finding a maximum common part of two input graphs, 

(iv) finding a reaction atom mapping, 

(v) enumerating possible chemical graphs, 

(vi) enumerating stereoisomers. 
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where (i) and (v) are closely related to unique naming and 
enumeration of structural isomers, respectively. We do not intend to 
provide a comprehensive review on these problems because there are 
too many methods even for any of these six problems. Instead, we try 
to clarify the computational complexities of them. We also introduce 
some recent developments on problems (v) and (vi) with focusing on 
our recent work because our algorithms are based on somewhat 
different approaches than traditional approaches in chemoinformatics 
[2] and they have guaranteed computational complexities. Since this 
review article focuses on time complexity aspects of the problems and 
algorithms, readers interested in practical and heuristic methods in 
chemoinformatics are referred to existing books and review articles: 
[2] for fundamental algorithms, [3,4] for pattern matching 
algorithms, [2,5,6] for prediction and regression methods, and [2,7] 
for enumeration algorithms. 

As discussed later, most of the above problems are intractable for 
general graphs from a viewpoint of computational complexity. 
However, chemical graphs have several restrictions. For example, the 
maximum number of bonds connecting to an atom is usually less than 
8. Making use of these constraints, it is often possible to develop 
theoretically efficient algorithms. Therefore, before discussing 
individual problems, we briefly review graph classes that are relevant 
to chemoinformatics. 

The organization of this article is as follows. First, we review 
graph classes relevant to chemoinformatics and give a brief 
introduction of computational complexity. Next, we review 
theoretical results and some algorithms on problems (i)-(iv). Next, we 
describe a relationship between kernel methods and enumeration 
problems, where kernel methods are a kind of machine learning 
method and have been applied to various chemoinformatics problems. 
Then, we review our recent algorithms for problems (v) and (vi) 
because they are based on state-of-the-art techniques in graph 
algorithms and thus may bring new methodologies into 
chemoinformatics. Finally, we conclude with future work. For the 
purpose of simplicity of presentation, we do not give formal 
definitions, instead explain terms and results by using words and 
figures. 
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Graphs and chemical compounds 

Chemical graphs 

A graph consists of a set of vertices and a set of edges, where a 
vertex and an edge correspond to an atom and a chemical bond, 
respectively. Each graph is denoted as G( Y,E) where Y denotes a set 
of vertices and E denotes a set of edges. There are two kinds of 
graphs: directed graphs and undirected graphs. Each edge has a 
direction in directed graphs, whereas no edge has a direction in 
undirected graphs. Since there is usually no explicit direction in 
chemical bonds, we only consider undirected graphs in this article and 
thus each edge is represented by a set of two vertices (i.e., two atoms 
connected by the corresponding chemical bond). 

In order to associate chemical structures to graphs, we employ 
labels of vertices and edges. Each vertex v has a label /(f), which 
represents an atom type (e.g., /(f)='C' if v corresponds to a carbon 
atom). In this article, a graph with vertex and edge labels defined as 
above is called a chemical graph. There are two ways to represent a 
chemical bond with multiplicity: 

(i) multiplicity is represented by multi-edges (e.g., double bond is 
represented by two edges), 

(ii) multiplicity is represented by a label of an edge (e.g., /(e)=2 if an 
edge e corresponds a double bond). 

We mainly consider the latter way of representing chemical bonds 
in this article, where /(e)=1.5 may represent an aromatic bond. The 
degree of a vertex is defined as the number of edges connecting to it, 
and is closely related to the valence of an atom. A graph with a 
designated vertex r is called a graph rooted at a vertex r. Isomorphism 
between two rooted graphs assumes that the roots of the two graphs 
correspond each other. In this paper, we utilize a fast algorithm 
designed for enumerating rooted trees. However, we designate as the 
root of a tree a special vertex of the tree which is uniquely determined 
by the topological structure only, and thereby our algorithms 
effectively enumerate ''unrooted'' trees. 

Graph classes 

As mentioned above, chemical structures can be represented as 
graphs. However, we need not consider all kinds of graphs. For 
example, it is known that most atoms have valence at most 8, which 
implies that the maximum degree of chemical graphs is at most 8. 
Therefore, it is enough for chemical structures to consider graphs 
with bounded degree. In what follows, we only consider bounded 
degree graphs. 

As discussed later, many graph problems can be solved much 
faster if we restrict types of graphs. Therefore, we review here several 
graph classes that are relevant to chemical applications (see Figure I). 
For details of graph classes, see [8]. 

Tree: A graph is called a tree if it is connected and does not have a 
loop, where 'connected' means that there exists a path (a sequence of 
connected edges) connecting any pair of vertices. Trees are one of the 
simplest graphs and many problems can be solved much more 
efficiently for trees than for general graphs. 

Outerplanar graph: A graph is called outerplanar if it can be 
drawn on a plane in such a way that all vertices lie on the outer face 
without crossing of edges, where the outer face is the unbounded 
exterior region. Trees are a subclass of outerplanar graphs. 

Almost tree: A graph is called an almost tree (with parameter if 
each biconnected component (i.e., maximal non-tree part) is obtained 
by adding at most k edges to a tree. Trees are a subclass of almost 



trees (i.e., ^=0). However, outerplanar graphs are not a subclass of 
almost trees or almost trees are not a subclass of outerplanar graphs. 

Partial ^-tree: A graph is called a partial k-tree if it is transformed 
into a tree by regarding a family of subsets of vertices as a set of new 
vertices (i.e., by tree decomposition), where each subset consists of at 
most kW vertices (Figure 2). Trees, outerplanar graphs, and almost 
trees with parameter are subclasses of partial I -trees, partial 2-trees, 
and partial iH" I -trees, respectively [9]. 




Figure 1. Examples of (a) tree, (b) almost tree, and (c) outerplanar graph, 
where k-2 in (b). 

Yamaguchi et al. studied the distribution of partial ^-trees in 
chemical graphs [10]. Horvath and Ramon also studied the 
distribution of partial i:-trees in some dataset and reported that 
8.77%, 97.35% and 99.97% of compounds are partial I -trees, 2- 
trees, and 3 -trees, respectively and most partial 2-tree compounds are 
outerplanar [II]. 

Computational complexity 

Here, we briefly review some basic concepts in computational 
complexity. If the computation time of an algorithm is proportional 
to ri^ (i.e., the computation time is C^z?'^) for some constant d where 
n denotes the size of an input data, the algorithm is said to be a 
polynomial- time algorithm. There exist many problems that do not 
have polynomial-time algorithms. Although we do not explain details, 
MP-hard problems are widely believed not to have polynomial-time 
algorithms. In theoretical computer science, polynomial-time 
algorithms are regarded as efficient algorithms whereas NP-hard 
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problems are regarded as intractable problems. However, NP-hardness 
does not necessarily mean practical inefficiency. In particular, the 
number of vertices in a chemical graph is usually less than 100. 
Therefore, there is room for development of practically efficient 
algorithms for chemical compounds even if the problems are NP- 
hard. 



tree decomposition 
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Subgraph isomorphism 

Another important problem is to decide whether some chemical 
graph is included as a part of another chemical graph. Deciding 
whether a benzene ring is contained in a given chemical graph is an 
example of this problem. This problem is called the subgraph 
isomorphism problem in graph theory. Although it is known that the 
subgraph isomorphism problem is NP-hard even for graphs of the 
maximum degree 3, a polynomial-time algorithm is known for partial 
^-trees of bounded degree [15,16]. Therefore, the subgraph 
isomorphism problem can be solved in polynomial time for almost all 
chemical graphs. Unfortunately, the algorithms in [15,16] are not 
practical. 

For practical applications to chemical compounds, subgraph 
isomorphim algorithms based on maximum clique or a branch-and- 
bound method [17] have been widely developed and utilized (see for 
example, [18] for the former approach and [19] for the latter one). 
Although they are not guaranteed to work in polynomial time, they 
work fast in practice. It is also important to search graphs containing 
subgraphs isomorphic to a given query graph in a chemical database 
[18,19]. In such an application, filtering non-relevant compounds is 
quite useful and thus various features have been proposed and utilized 
(see for example, [19]). 



Figure 2. Examples of partial /r-tree for /f=3. The right figure shows a tree 
decomposition of G{V,E). 

Comparison of chemical graphs 

Comparison of graph structured data is fundamental and 
important for chemoinformatics and pattern recognition. Indeed, 
extensive studies have been done to develop practical algorithms for 
that purpose [2-4]. In computer science, extensive theoretical studies 
have also been done for comparison of graphs. However, it seems that 
these theoretical results are not well-known in chemoinformatics. 
Although most of theoretical algorithms are not efficient in practice, 
they might give some hints for development of practically efficient 
algorithms. In this section, we mainly review theoretical results on 
graph isomorphism, subgraph isomorphism, maximum common 
subgraphs, and reaction atom mapping, all for chemical graphs 
(Figure 3 and Figure 4). 

Isomorphism and signature 

The most fundamental problem for comparison of chemical 
graphs is to test whether given two graphs are identical (i.e., 
isomorphic). Although it is unknown whether there exists a 
polynomial-time algorithm for general graphs, a polynomial-time 
algorithm is known for graphs of bounded degree [12], which means 
that isomorphism of two chemical graphs can be tested in polynomial 
time. However, this algorithm is not practical because it is based on 
group theory and the degree of polynomial in the time complexity is 
high. 

It is also important to give a normal form of a given graph, which 
is equivalent to giving a unique name to a graph. Indeed, huge efforts 
have been paid in chemistry to define rules for giving unique names to 
chemical structures (e.g., lUPAC nomenclature). In computer science, 
a group -theoretic polynomial-time algorithm is also known for the 
normal form problem for graphs of bounded degree [13] although it 
is not practical. Faulon gave detailed discussions on isomorphism and 
normal forms for chemical graphs and presented polynomial-time 
algorithms for chemical graphs with planar structures [14]. 




Figure 3. Comparison of graphs. Graph (a) is isomorphic to graph (b). 
Graph (c) is subgraph isomorphic to graph (a). Graph (f) is a maximum 
common subgraph between graphs (d) and (e). 



Maximum common subgraph 

It is also important to find a common part of two or more 
chemical graphs. Among several possible formulations, the most 
fundamental one is the maximum common subgraph problem 
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(precisely, the maximum common connected edge subgraph problem) 
for two graphs [3]. Since a polynomial-time algorithm was developed 
for almost trees of bounded degree [20], there had been almost no 
significant progress [10] until recently from a viewpoint of the 
computational complexity. However, Akutsu and Tamura recently 
developed a polynomial-time algorithm for outerplanar graphs of 
bounded degree [21]. On the other hand, they also showed that the 
problem remains NP-hard for partial ^-trees of bounded degree with 
1 = 11 [22]. 

For general graphs, some exponential time algorithms have been 
developed. Huang et al. presented an CXjrl'IrF) time algorithm and 
showed an Q{^fJrL)if^^^~^ time lower bound under some reasonable 
assumption on complexity class where n is the number of vertices of a 
larger input graph, h is the size of the maximum common subtree, and 
f \s> any recursive function [23]. However, we could not confirm 
CXri'lf) time complexity and speculate that it should be CXjn^^lF), 
Abu-Khzam et al. developed an C^I.274'"+3''/^(/?+I)'') time algorithm 
where c is the size of the minimum vertex cover of a smaller input 
graph [24]. 




(>A2,e2) (^3,B3) 



Figure 4. Reaction atom mapping. In this case, there exist three possible 
reaction atom mappings, where (/4i,8i) is the most plausible. By deleting 
the edges (bonds) crossing to curves of each type, we can obtain 
isomorphic graphs between the left hand side and right hand side of the 
reaction, which also gives a sequence of graph editing operations. 

Reaction atom mapping 

In addition to comparison of chemical graphs, there exists another 
important pattern matching problem for graphs: finding a minimum 
cost sequence of graph editing operations that transforms one input 
graph into the other input graph [25]. There exist several variants of 
the problem depending on the definition of editing operations [26]. 
There also exists a close relationship between the minimum graph edit 
problem and the maximum common subgraph problem, both of 
which are known to be NP-hard in general [25]. Since minimum 
graph edit is too wide, we focus on the reaction atom mapping 
problem (Figure 4). It is a problem of finding an optimum 
correspondence (or enumerating all possible mappings) between 
atoms before and after a reaction. It is also defined as a problem of 
finding a minimum cost sequence of deletions and additions of edges 
that transforms one input graph into the other input graph where 
input graphs can be disconnected and the set of atoms must be 
preserved before and after the edit sequence. Consider a simple 
reaction of the following type 

X-A + Y-B ^ X-B + Y-A 



where X, Y, A, B are chemical species. Then, by deleting two edges 
(edges between X and A , and between Y and B) from the left hand 
side of the reaction and inserting two edges (edges between X and B, 
and between Y and A), we have the right hand side of the reaction. It 
also gives a mapping between atoms before and after the reaction. 
However, mapping or edit sequence may not be determined uniquely 
as shown in Figure 4. In such a case, it may be required to enumerate 
all possible mappings, or to find a chemically optimal mapping. Since 
there exist several formulations depending on applications, we do not 
give precise definitions here. This problem has several applications 
including consistency check of chemical reactions in a database and in 
silico tracer experiments. 

Arita developed a heuristic method for the reaction atom mapping 
problem based on maximum common subgraph (MCS) [27]. Hattoti 
et al. also applied their own maximum common subgraph (MCS) 
algorithm to find reaction atom mappings [28]. However, Arita 
pointed out that MCS approaches sometimes fail to find desired 
mappings [27]. Akutsu firstly gave a mathematical formalization of 
the problem, proved NP-hardness of the problem, and developed an 
algorithm based on unique graph naming and exhaustive examination 
of cutting edges [29]. Crab tree and Mehta developed faster 
algorithms based on unique graph naming and efficient combinatorial 
search for cutting edges [30]. Heinonen et al. also developed a fast 
algorithm based on A"^ search [31], a well-known efficient searching 
technique in artificial intelligence. Zhou and Nakhleh developed a 
method to find all symmetries in both a chemical compound and a 
chemical reaction, where the latter can be used to find reaction atom 
mappings with stereo chemical information [32]. 

Kernel methods and pre-image problem 

Recently, kernel methods, which include support vector machines 
(SVMs), have become one of the standard tools in machine learning. 
Kernel methods have also been extensively applied to Quantitative 
Structure-Activity Relationship (QSAR) and Quantitative Structure- 
Property Relationship (QSPR) problems [2,5,6] whose purposes are 
to predict the chemical activity and property for a given chemical 
compound, respectively. For example, various kernel functions for 
QSAR/ QSPR have been developed based on alignment of two 
chemical graphs [33,34], three-dimensional superposition [35], 
Tanimoto and other coefficients [36], molecular descriptors [37], and 
subtree patterns [38]. 

In order to apply kernel methods to chemical structures, it is 
usually required to map a chemical graph to a feature vector in a 
feature space (i.e., a vector in high-dimensional Euclidean space or 
infinite-dimensional Hilbert space) because a kernel function is 
defined as an inner-product between two feature vectors. Although 
various methods have been proposed for design of feature vectors for 
chemical graphs, those based on frequency of small fragments [39-41] 
and frequency of labeled paths [42,43] have been widely used, where 
weights/ probabilities are sometimes put on paths/ fragments. 

For example, consider chemical compounds consisting of atoms 
of type C, N, O, H. Then, there are 4 kinds of labeled paths of length 
0 (i.e., C, N, O, H), 16 kmds of labeled paths of length I (i.e., C-C, 
C-N, ...), 64 kinds of labeled paths of length 2, and so on. Figure 5 
shows an example of a feature vector based on frequency of labeled 
paths (precisely, a feature vector based on the numbers of occurrences 
of labeled paths). For a chemical graph shown in Figure 5, the 
numbers of atoms of type C, N, O, H are 5, 1,2, 9, respectively, and 
thus the coordinate values corresponding to paths of length 0 are 5, I, 
2, 9. For the same graph, the number of C-N bonds is 2 and thus the 
corresponding coordinate value is 2. For C-C bond, the coordinate 
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value is 8 because each C-C bond is counted twice as two paths for 
opposite directions. The maximum length of paths of a feature vector 
is called the level the feature vector. Therefore, the vector in Figure 
5 is a feature vector of level I. 
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Figure 5. Feature vector and pre-image. 

In chemo informatics, inverse QSAR/QSPR is becoming 
important because it might lead to design of new chemical 
compounds and new drugs [44-46]. As closely related concept, the 
pre-image problem has been studied in machine learning [47,48]. In 
the pre-image problem, a desired object is specified or computed as a 
vector in a feature space using a suitable objective function and then 
the vector is mapped back to the input space, where this mapped back 
object is called a pre-image. Then, it is expected that this pre-image 
has a desired property (e.g., chemical activity). Akutsu et al. showed 
that the graph pre-image problem based on path frequency is solved 
in polynomial time of the number of atoms if the graphs are trees 
whose maximum degree is bounded by a constant and the lengths of 
given paths and the number of atom types are bounded by constants 
[49], which was further extended to outerplanar graphs with some 
constraints [50]. However, this algorithm is not practical due to its 
high degree of polynomial. Nagamochi proved that the graph pre- 
image problem can be solved in polynomial time for both tree and 
general graphs if the level of feature vectors is I [51]. Although the 
corresponding algorithm is efficient in practice, it can only be directly 
applied to feature vectors of level I . Furthermore, it is difficult for the 
above mentioned algorithms to cope with various constraints. 
Therefore, it is useful to develop algorithms for the pre-image 
problem based on enumeration of chemical graphs, which is explained 
in the next section (see [52] for more detailed review). 

It is to be noted that extensive studies have been done on inverse 
QSAR/QSPR, where the descriptors correspond to feature vectors in 
the pre-image problem. Kier et al. developed methods for 
reconstructing molecular structures from the count of paths of a 
length up to two and three by combining enumeration with bounding 
operations [44]. Skvortsova et al. developed a similar method where 
paths of the same length are further classified into several classes sing 
atom and bond information [45]. Faulon et al. defined a descriptor 
based on trees and developed methods to enumerate all the structures 
consistent with a given descriptor [46]. Wale et al. compared various 
descriptors including ECFP descriptors, not for inverse 
QSAR/ QSPR but for compound retrieval and classification [53]. 



Enumeration of chemical graphs and stereoisomers 

The enumeration of chemical structures has a long history 
beginning from the work by Cayley in the 1 9th century [I]. 
Enumeration of chemical structures has many applications in 
chemistry, which include structure determination using mass-spectrum 
and/ or NMR-spectrum, virtual exploration of chemical universe, 
reconstruction of molecular structures from their signatures, and 
classification of chemical compounds. 

Some useful tools such as MOLGEN have been developed for 
enumeration of chemical graphs [2,54]. Although MOLGEN is very 
efficient, it is worthy to examine other approaches because extensive 
studies have been done on graph enumeration in the field of 
theoretical computer science and data mining [55,56]. In particular, it 
might be possible to develop faster algorithms for the enumeration 
and pre-image problems if we restrict the class of target chemical 
graphs and employ state-of-the-art techniques for enumeration of 
graph structures. Based on this idea, we have been developing several 
algorithms for enumeration of chemical graphs (i.e., structural 
isomers) and stereoisomers. Although our enumeration algorithms for 
structural isomers may not yet be faster in practice than MOLGEN, 
our algorithms have some guaranteed time complexities. In addition, 
our enumeration algorithms for stereoisomers are quite fast for 
counting: they work in optimal linear time and are very fast in 
practice. It is to be noted that the number of isomers grows 
exponentially to the number of atoms in general and thus it is 
impossible to output all isomers in polynomial time. Therefore, the 
purpose of development of fast enumeration algorithms is to reduce 
the computational complexity required per isomer. In this section, we 
explain key ideas used in these algorithms. Since some details are a bit 
involved, readers not interested in algorithmic details can skip such 
parts. 

Enumeration of chemical graphs 

Given a feature vector /'that specifies the frequency of each path 
of length at most K>0, our aim is to enumerate all tree-like 
chemical graphs whose path frequency for all paths of length up to K 
is identical with £ A chemical graph is called tree-like if it becomes a 
simple tree (an diZyoSiz graph) by replacing the multiple edges (bonds) 
between every two vertices with a single edge. 

Ishida et al. [57] developed an efficient algorithm for enumerating 
all tree-like chemical graphs that satisfy a given single feature vector £ 
The algorithm consists of two major phases, each of which is designed 
based on a hranch-and-hound method (see [17] for details), one of 
the standard enumerative approaches for solving combinatorial 
problems. In the first phase, we first design a branching operation to 
generate all simple trees that satisfy only the frequency of path of 
length 0 in the vector /"(i.e., a set K of vertices is specified by /), 
ignoring any multiplicity prescribed in £ Starting with the empty 
graph, the operation appends a new vertex to the current tree until the 
tree has the vertex set K A tree obtained from the current tree T 
by appending a new vertex is called a child of 77 where 7" is called the 
parent of T\ Since the current tree T may have more than one 
adequate position to which a new vertex is appended (i.e., 7"may have 
more than one child), the parent-child relationship forms a tree 
rooted at the empty graph, called a £a.mily tree. Each node v with 
depth k in the family tree represents a tree l\v) k vertices, and the 
trees of the descendants of ifl will be generated from 7^f) by repeated 
applications of the branching operation. 

Figure 6 (a) shows part of a family tree for generating simple trees 
with a specified vertex set Y of four carbons, one oxygen and five 
hydrogens, where the parent of each simple tree 7T is 7T^i. A family 
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tree is determined by how to define parents of trees, which also 
defines the children and affects the computational efficiency. Using 
the fastest tree enumeration [58], we can generate a child from the 
current tree in a constant time. 

The branching operation is rather a straightforward enumeration 
which can output each tree efficiently, but may generate trees that do 
not satisfy some of the required conditions of the problem such as the 
frequency of path of length > I . In a branch-and-bound method, we 
also incorporate into the process of executing a branching operation, 
another procedure, called a bounding operation, to discard part of the 
process which produces only candidates that do not lead to any 
solutions. In our problem, we apply several quick tests to each node v 
in the family tree to try to know whether there is a descendant w of v 
such that TJ^w) is a solution; i.e., the tree T^f) can be extended to a 
simple tree on the specified vertex set K without violating any of the 
required conditions. Our bounding operation tests the following 
criteria and skips the process of appending vertices to l\v) if one of 
them holds: 

(1) The root of 7(f) cannot be the centroid after any extension 
of 7( f) (the centroid constraint); 

(2) The frequency of some path in 7(f) exceeds the value 
specified by /'(the feature vector constraint); 

(3) The valence of a labeled vertex in 7(f) exceeds the valence of 
the atom (the valence constraint); and 

(4) 7(f) cannot be extended to a tree with the vertex set Kand 
edge set B specified by f (the detachment constraint). 
Whether criterion (4) holds or not can be tested efficiently by 
the algorithms in [51]. 



considering the hardness of treating graphs with one cycle compared 
with tree-like graphs. 

The computational efficiency of the above algorithms relies on 
the result that all vertex-colored trees with at most n vertices in 
constant time per output [58]. As an extension, we have proved that 
all trees with labeled vertices with exactly n vertices (resp., all rooted 
outerplanar vertex-labeled graphs with at most n vertices) can be 
generated in constant time per output [61] (resp., [62]). 
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The first phase obtains a set S^F) of simple trees that satisfy the 
given path frequency ignoring the multiplicity. In the second phase, 
we select each simple tree T^S^f) and generate all tree-like chemical 
graphs that satisfy the given feature vector fhj assigning multiplicity 
on adequate edges in 7T Our algorithm for the second phase is also 
designed based on a branch-and-bound method and the idea of family 
trees. Figure 6 (b) shows part of a family tree for assigning 
multiplicity to simple tree 71, where the parent of each tree-like graph 
G is G^i. 

Note that a feature vector f on path frequency specifies the exact 
number of times each path appears in a graph to be constructed. 
When a vector f is artificially constructed, no graph with such a path 
frequency in /'may exist in many cases. To avoid this, Shimizu et al. 
[59] recently introduced a problem of constructing all tree-like graphs 
which satisfy one in a given set F of feature vectors. A set 7^ of feature 
vectors is specified by a pair of upper and lower vectors fu and fi on 
path frequency such that fu and fi have the same frequency of path of 
length 0 (i.e, both specifies the same set of vertices), and the set 7^ is 
defined to be the set of all vectors /"such that fL<^fu. Shimizu et al. 
[59] successfully designed a two-phase algorithm for handling the 
new problem directly without repeatedly applying the algorithm by 
Ishida et al. [57] to each feature vector ^F. 

Recently Suzuki et al. [60] have developed an algorithm for 
enumerating graphs with at most one cycle (of length at least 3) from 
a set F of feature vectors specified by upper and lower vectors fu and 
f on path frequency. The main idea of this algorithm is to define the 
parent of a graph with exactly one cycle to be a tree-like graph by 
removing an edge in the cycle and to design as the third phase a 
procedure for generating a graph with exactly one cycle from each 
tree-like graph 7" obtained after the second phase of the algorithm by 
Shimizu et al. [59]. Our experimental result reveals that the 
computational efficiency of the new algorithm remains high 



Figure 6. Family trees for enumeration of chemical graphs, (a) A family 
tree for generating simple trees; (b) A family tree for assigning multiplicity 
to simple tree T\^S{f\. 

Enumeration of stereoisomers 

We have developed algorithms for enumerating all stereoisomers 
of a given chemical graph G that admits tree or outerplanar structures 
[63,64]. The algorithms are based on dynamic programming, another 
standard enumerative approach for solving combinatorial problems 
(see [17] for details). For simplicity, we describe the idea only for 
tree-like graphs and the asymmetry around carbon atoms with no 
double bonds incident to them (the asymmetry formed by double 
bonds such as cis-trans type can be treated with a modified argument). 
Thus a carbon atom is asymmetric if the four substructures around it 
are all sterically distinct. Figure 7 (a) illustrates that the three- 
dimensional structure around a carbon atom forms a regular 
tetrahedron, where d^, di, di and d^ represent the directions along the 
four edges incident to the carbon atom. Figure 7 (b) and (c) show 
two configurations around the asymmetric carbon atom in lactic acid. 




(a) (b) (c) 

Figure 7. Example of stereoisomers, (a) The four directions do, di, dj and 

c/3 around a carbon atom in the three-dimensional space; (b), (c) Two 
configurations around the asymmetric carbon atom in lactic acid. 
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Given a tree-like chemical graph G, we regard it as a tree rooted at 
its centroid r (the vertex removal of which leaves no component 
containing more than a half number of the vertices). For each vertex v 
in G, we denote by Tv the subtree induced by v and all its descendants 
in G, and let ^v) denote the number of stereoisomers of Tv, For a 
non-root carbon atom f in G", a stereoisomer of Tv is determined as 
follows. Let ui, m and m be the three children of v in G, where v is 

adjacent to the three subtrees T- = , i— 1,2,3, Note that v is 

adjacent to the fourth subtree T composed of the rest of vertices, 

which is always structurally distinct from every T- = , See the 

carbon atom v in Figure 8 for four subtrees 71, 71, T and 71. 

Suppose that a stereoisomer of 71-, say, the ^-th one T^^"^ , consists of 

the Ai-th stereoisomer T-^^'"^ of T for each /— 1,2,3, where 

1,2,. . .,y(y/)}. Then we have two cases: (a) some two of T^^^^'^ ^ 



7^^^^^ and T^'^^^ are sterically same (i.e., the same stereoisomer); and 

(b) every two of J^^^^^ ^ T^^^^'^ and 7^^^^^^ are sterically distinct. In 
(b), V is asymmetric, and there are exactly two different three- 
dimensional positions of T — {T^^^\T2^^\T^^^'^ } around v, each 
of which we denote by (7=+ or (7=-. In (a) v is symmetric, and we let 
(7=0. Thus, a stereoisomer 7^^^^ of 71 is represented by (t^Ct) (or 
symbolically by an index vector (i:i ,^2,^3,(7)). Let ^f) (resp., ^(f)) 
denote the number of collections T of three stereoisomers of by 

which case (a) occurs (resp., (b) can occur). Then the number of 
stereoisomers of 71 is given as I(^v)—^v)+2h(^v), 



rooted at centroid 




f{v) = giv) + 2h{v) = fiu,)f{u,) + 2 



V 



Similarly if 71, 71 and 71 are all structurally distinct (resp., same), 
then we have ^v)—0 and h(^v)—I(^ui^I(^m)^ui) (resp., g(^v)—i(^ui^i(^ui) 



and h(v) = 



V 



). For example, the carbon atom n in Figure 



8 has no two children which have the structurally same subtree, and 
we have ^fi)=0, ^(fi)=I and y(Fi)=2. Note that ^vz)—! and 
y(p^)=I. For the carbon atom j^in Figure 8, only n and vi among its 
children have the structurally same subtree, and we have 

2 



^f^)=y(Fi)y(F3)=2, h(w) = 



V 



f(v^) = l andy(p^)=4. We 



can regard the ^(pf)— 4 stereoisomers are represented by index vectors 



r^" = (1,1,0) , r;^' = (2,2,0) , r^^' = (1,2,+) , and 

T^"^^ — (1,2,—). By computing ^(f) for all carbon atoms f in a 

bottom-up manner along G, we can finally determine the number Kg 
of all stereoisomers of G, The first phase can be implemented to run 
in (\n) time and space for a tree-like graph or outerplanar graph G 
with n vertices. 

Output phase. For each number /— 1,2,. . .,^1?, the second phase 
outputs the i-th stereoisomer of G. The /-th stereoisomer of G will be 
characterized by choosing an index vector {k\,ki,k2,,0^ for each carbon 
atom in a top-down manner along tree G, When we visit a carbon 

atom F, we are supposed to compute the ^-th stereoisomer T^^^ of 

the subtree 71 for a number ^ { 1,2,. . .,y(F)} which has been 
determined by the process applied to the ancestors of v so far. Our 
task is to compute {k\,ki,h,,0^ from a given number ^ { 1,2,. . .,y(F)}. 
For example, the carbon atom w in Figure 8 has £^w)—A 
stereoisomers, which are supposed to be represented by the index 
vectors in the above. Then if ^=3 at w, then we take 

— (1,2,+) and assign ki—\ and ki—2 to our process for the 

children w and vi, respectively. Based on how ^ f) and ^( f) have been 
computed in the first phase, we can identify the ^-th index vector 
without explicitly constructing a complete ranking table for the £^v) 
index vectors. We repeat this process until the corresponding index 
vector for each carbon atom in G is determined. The second phase 
can be implemented to generate each stereoisomer of a tree-like graph 
(resp., outerplanar graph) G with n vertices in C\n) (resp., C^/r^)) 
time per output using 0{n) space. 



Figure 8. A tree-like chemical graph G rooted at its centroid r. 



Our algorithm consists of two major phases: (i) the first phase 
counts the number Kg of stereoisomers of G by dynamic 
programming; and (ii) for each ^=1,2,. . ,,Kg, the second generates the 
^-th stereoisomer by backtracking the computation in (i). 

Counting phase. The first phase computes y( f) of each vertex v in 
a bottom-up manner along G When we visit a non-root carbon atom 
F, £^ui) of each child w, i— 1,2,3 of Fhas been computed. If 71 and 71 
are structurally same (where y(yi)=y(a2)) and 71 is structurally 
distinct from 71 and 71, then we have ^F)=y(m)y(w3), 

h(v)= fi^s) and 



Summary and outlook 

In this article, we have reviewed graph classes relevant to chemical 
compounds, theoretical results on the computational complexities of 
the isomorphism, normal forms, subgraph isomorphism, and 
maximum common subgraph problems of chemical graphs, and 
algorithms for computation of reaction atom mappings and 
enumeration of chemical graphs and stereoisomers. 

As discussed above, the isomorphism and normal form problems 
can be solved in polynomial time for all chemical graphs and the 
subgraph isomorphism can be solved in polynomial time for almost 
all chemical graphs. However, these polynomial-time algorithms are 
not practical because of their high degrees of polynomials. Therefore, 
algorithms having low-degree polynomial complexities should be 
developed. As for the maximum common subgraph problem, there 
exists a polynomial time algorithm for chemical graphs with 
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outerplanar structures, whereas it is NP-hard for partial ^-trees with 
k—11. Since outerplanar graphs are a subclass of partial 2-trees and 
most chemical graphs are partial 3 -trees, it is interesting to study the 
complexity of the maximum common subgraph problem for partial 2- 
trees and partial 3 -trees. For the reaction atom mapping problem, 
several practical algorithms have been developed although it is NP- 
hard in general. 

From a practical viewpoint, it is not necessary to develop 
polynomial-time algorithms because the size of chemical graphs is 
usually limited. Therefore, development of efficient exponential-time 
algorithms and/ or fixed-parameter algorithms [65] is another 
theoretical approach to these problems. Of course, some of existing 
practical algorithms work very efficiently for most chemical graphs. 
Analysis of the computational complexities of such algorithms might 
also lead to development of faster algorithms. 

In the latter part of this article, we explained algorithms for 
enumeration of chemical graphs and stereoisomers that were 
developed by the authors and their colleagues. These algorithms are 
fast in both theory and practice. They were implemented on the 
EnuMol web server ( http: / /sunflo wer.kuicr.kyoto- 

u.ac.jp/ tools/ enumol2/ ) and are freely available via the web page for 
academic purposes, where the details of EnuMol will be reported 
elsewhere. However, the classes covered by these algorithms are 
limited. Therefore, development of algorithms that cover most 
chemical graphs is important future work. Of course, enumeration 
takes quite a long time for large chemical graphs because the number 
of objects to be enumerated grows exponentially to the number of 
atoms. Therefore, introduction and effective use of adequate 
constraints are important and necessary in future studies. 
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